Murrinh-Patha Complex Verbs: Syntactic Theory and Computational Implementation

نویسندگان

Melanie Seiss

Miriam Butt

Rachel Nordlinger

چکیده

away from the surface realization. In order to achieve such a morphological analysis, a morphological analyzer needs to be implemented which can handle the long distance dependencies and the phonological rules that apply. Such a morphological analyzer was implemented using the Xerox finite-state technology tools xfst and lexc (Beesley & Karttunen 2003). This system has been used because it offers a wide range of inbuilt mechanisms to handle complex cases such as those found in the Murrinh-Patha verb. This section discusses how these inbuilt mechanisms facilitate the modeling of the complexities on some selected examples. The following section then describes how this implementation is used to construct a robust morphological analyzer that can be successfully applied to real text. Before the details of the implementation of the morphological analyzer are discussed, though, a brief overview over related work, i.e., over rulebased morphological analyzers using finite state methods for more or less morphologically complex languages, is provided. The concept of finite state morphology was developed in the 1980s as a tool for the computational morphological analysis of natural language. Beesley & Karttunen (2005) provide a detailed historic overview and a formal description of the formalisms. Since Koskenniemi (1983) combined ideas of sequenced phonological rewrite rules with two-level morphology and proposed a first computational implementation of Finnish morphology, finite state morphologies have been developed for a number of diverse languages, language families and language types. Some examples are the treatment of South Asian languages such as Bögel et al. (2007) for Urdu or Veerappan et al. (2011) for Kannada, the treatment of Indonesian by Larasati et al. (2011) as a representative for Austronesian languages or for Estonian (Uibo 2005) and Finnish (e.g., Lindén & Pirinen 2009) from the Uralic language family. Semitic languages have traditionally been in the focus of finite-state methods for the treatment of their morphology due to the special challenges the root-and-pattern morphology poses. Some examples of treatments of the Semitic languages are, e.g., Beesley (1996) and Attia et al. (2011) for Arabic as well as Yona & Wintner (2008) for Hebrew. Karttunen (2003) presents an xfst implementation of the Bantu language Lingala in which he shows that even analyses cast within paradigm function morphology (Stump 2001) can be implemented with xfst’s regular expressions. While all these approaches face different challenges posed by the respective morphology of the language, to my knowledge so far no finite state implementation of an Australian language with such a complex templatic verbal morphology as is found in Murrinh-Patha has been proposed. Sproat & Brunson (1987) present a morphological analyzer of Warlpiri, focusing on the problem of reduplication. They make heavy use of prosodic information. Their implementation does not use a two-level morphology and is also not A computational implementation of the morphology 157 finite-state. As such, the basic layout of the analyzer differs considerably from the Murrinh-Patha analyzer presented here and the other finite-state analyzers mentioned above. Closest to the implementation of Murrinh-Patha described here is probably the treatment of Basque by Alegria et al. (1996) and the treatment of Persian by Megerdoomian (2004). The implementation of the Persian morphology proposed by Megerdoomian (2004) is concerned with problems involving tokenization, phonological rules and long-distance dependencies. However, it seems that the long-distance dependencies for Persian, which are modeled with flag diacritics (see below), are simpler than the ones found in Murrinh-Patha. Alegria et al. (1996) in their implementation for Basque propose a formalism for long-distance dependencies, but treat only simple long-distance dependencies. Their implementation is, however, relevant because it proposes a three-step lookup strategy similarly to the lookup strategy proposed in the following section. Most recent projects either use xfst (Beesley & Karttunen 2003), or the open-source alternatives Foma (Hulden 2009) or hfst (Helsinki Finite-State Transducer, Lindén et al. 2011). xfst is, for example, used in the implementations by Yona & Wintner (2008) for Hebrew or by Bögel et al. (2007) for Urdu. On the other hand, Attia et al. (2011) and Larasati et al. (2011), for example, explicitly state that they use Foma because of licensing issues. For the implementation of Murrinh-Patha that is described in this chapter, the Xerox finite-state technology tools xfst and lexc (Beesley & Karttunen 2003) have been used, as the combination of both programming languages provides powerful tools to implement morphological complexities. The goal of the computational implementation is to associate a surface form of a word with its morphological analysis. An example is provided in (5.10) in which the Murrinh-Patha verb bamkardu is associated with the analysis as the classifier stem see(13) (+class13 ) in 3rd person singular non-future form, a 3rd person singular direct object and a lexical stem ngkardu. (5.10) bam+class13+3P+sg+nFut+3sgDO+ngkardu+LS : bamkardu For such a pair, the expressions upper and lower level are used. The morphological analysis (bam+class13+3P+sg+nFut+3sgDO+ngkardu+LS ) is the upper level while the surface form bamkardu is the lower level. The computational implementation is bidirectional, i.e., if presented with the analysis as input, it can return the surface form as output and vice versa. To model these levels, different mechanisms offered by xfst and lexc are used. The concatenation of morphemes is implemented with lexc (Beesley & Karttunen 2003) as two-level networks. It uses continuation classes which are implemented as so-called lexicons. A first simple example is provided in 158 A computational implementation of the morphology (5.11). The first lexicon is called ROOT, it comprises all possible first morphemes of a word. In this lexicon, the classifier stem bam is associated with the morphological information that it carries (bam+class13+3P+sg+nFut). The right side of the entry specifies which lexicon is used next. In (5.11), bam can be concatenated with objects from the lexicon SLOT2, which in this case only contains the 3rd person direct object marking which is not overtly realized (noted as 0). This combination combines with items from the lexicon LEX, which contains the lexical stem ngkardu. The hash key marks the end of a word. (5.11) Lexicon ROOT bam+class13+3P+sg+nFut:bam SLOT2; Lexicon SLOT2 +3sgDO:0 LEX; Lexicon LEX +ngkardu+LS:ngkardu #; In this way, the network in (5.11) can produce the output in (5.12). The string bamngkardu is associated with the information on the upper side, i.e., that this word is made up of a classifier stem bam which is classifier stem 13 inflected for third person singular non-future tense, a zero direct object marker and a lexical stem ngkardu. (5.12) bam+class13+3P+sg+nFut+3sgDO+ngkardu+LS : bamngkardu In the actual implementation of Murrinh-Patha verbs, the lexicon ROOT contains all forms of the 38 classifier stems, the lexicon SLOT2 all different object morphemes (among others), and a large number of different lexical stems are contained in the lexicon LEX. The other template slots are implemented with the help of lexicons in a similar way. However, as was discussed above, the surface forms of Murrinh-Patha verbs are often more than pure concatenations of morphemes due to the phonological rules that apply when morphemes are combined. xfst offers the possibility to formulate phonological rules which rewrite the concatenation of morphemes to the actual surface form of the verb (Kaplan & Kay 1994). For example, (5.13) accounts for the data in (5.2) in which /ng/ is lost if it follows an /m/ or /n/. Thus, the actual network contains the surface form and associated information as specified in (5.10). (5.13) [ n g k –> k || m , n ] For more complex cases in which the application of the rule depends on the lexical stem as in (5.3) above, the lexical stem can be marked when it is concatenated with the other morphemes and the regular expression A computational implementation of the morphology 159 can take this marking into account. Thus, the output of the concatenation of morphemes for the data in (5.3) is as illustrated in (5.14). In (5.14a), the lexical stem yel does not trigger any phonological change. In contrast, the lexical stem yerr in (5.14b) triggers a change and is therefore marked especially with a capital letter /Y/. The application of the phonological rules in (5.15) then causes /Y/ to be changed to /nth/ after /m/ and to /y/ in all other cases and so ensures the right surface form. The final network for these two examples associates the morphological information with the surface forms kanamyel and mintherr as is displayed in (5.16). (5.14) a. kanam+class4+nFut+3P+sg+nFut+yel+LS : kanamyel b. mim+class12+nFut+1P+sg+nFut+yerr+LS : mimYerr (5.15) [ m Y – > n t h ] .o. [Y – > y] (5.16) a. kanam+class4+nFut+3P+sg+nFut+yel+LS : kanamyel b. mim+class12+nFut+1P+sg+nFut+yerr+LS : mintherr The two-level approach makes it possible to distinguish between surface form and the associated information. This is especially helpful for MurrinhPatha as the same classifier and lexical stem combination might have very different surface forms, due to the large number of different forms of the classifier stems, the high number of morphemes in the verbal template and the phonological rules that apply. The two-level morphology allows one to abstract away from this surface form. This, for example, makes a statistical analysis of texts independently of the surface form possible. Murrinh-Patha is, however, challenging to a computational morphological implementation due to its long distance dependencies found in the verbal template. Dependencies between neighboring lexicons can be easily modeled by specifying different continuation classes, i.e., entries of one lexicon do not have to lead to the same next lexicon. However, most dependencies in the Murrinh-Patha verbal template are long-distance, which is very difficult to model just with continuation classes. xfst offers a possibility to model these long-distance dependencies with the help of flag diacritics. Flag diacritics are special entities in xfst which add a kind of “short term memory” to keep track of what choices have been made before. Thus, as Beesley & Karttunen (2003:341) explain, normally, “the transition from one state to the next depends only on the current state and the next input symbol”. Using flag diacritics allows one to keep track of choices made earlier, so that certain transitions can also be constrained by choices made earlier. In the implementation, flag diacritics can be recognized by two surrounding @-symbols. After the first @-symbol, an operator is followed by a feature-value pair, each separated by periods. Different operators exist, 160 A computational implementation of the morphology i.e. U(nification), P(ositive) setting, R(equire) test, D(isallow) test etc. The names of the features and values can be chosen arbitrarily, but for convenience, mostly morphological features and values have been chosen. As a first simple illustration of the use of flag diacritics, the long distance dependency between the tense marking on the classifier stem and separate tense markers in slot 6 will be discussed. For all tenses but the non-future and future irrealis tense, tense markers in slot 6 are obligatory. The relevant examples are repeated in (5.17). In example (5.17), bam is the non-future form of the classifier stem 13 while ba is the future form of the corresponding classifier. The future form has to combine with the future tense marker -nu (tagged as +Fut2) as can be seen in (5.17b); it is ungrammatical without -nu ((5.17c)). On the other hand, -nu cannot attach to the non-future classifier stem form ((5.17d)). (5.17) a. bam +class13 +3P +sg +nFut +3sgDO +ngkardu +LS: bam-ngkardu b. ba +class13 +3P +sg +Fut +ngkardu +LS +Fut2 : ba-ngkardu-nu c. ba +class13 +3P +sg +Fut +ngkardu +LS : *ba-ngkardu d. bam +class13 +3P +sg +nFut +3sgDO +ngkardu +LS +Fut2 : *bam-ngkardu-nu This interplay can be modeled with the help of Pand R-type flag diacritics. The P-type flag diacritics are used to set a value, e.g., for tense, to positive. In contrast, the R-type flag diacritics require a value to have been set to positive to allow the respective combination. (5.18) is a fragment of the implementation. The lexicon ROOT lists the classifier stems. From this lexicon, the various classifier stem forms are sent to different lexicons to pick up their respective tense information. For example, bam carries non-future information and is consequently sent to the lexicon NFUT to pick up the flag diacritic “@P.Tense.nFut@”. This flag diacritic sets the value for the attribute ‘Tense’ to ‘nFut’, i.e., it remembers that bam is nonfuture tense. Similarly to bam, other non-future classifier stem forms are also send to the same lexicon NFUT. In contrast, classifier stem forms in future tense such as ba are send to the lexicon FUT and receive the flag diacritic “@P.Tense.Fut@” to remember that its tense value is future. After the flag diacritics for tense information are picked up, other morphemes from the verbal template slots 2 to 5, e.g., direct and indirect object markers, incorporated body parts, lexical stems etc., can be attached. This is not represented in detail in (5.18) but indicated by the dots. Flag diacritics are used instead of continuation classes in this case because many morphemes can intervene between the classifier stem inflected for tense and the tense markers in slot 6. Finally, when the corresponding tense markers are attached in the lexicon TENSE SLOT6, the choices for the combination are constrained by the A computational implementation of the morphology 161 R-type flag diacritics which require a certain value for tense. The morpheme -nu can only attach to a future classifier stem form, i.e., this choice is marked with the flag diacritic “@R.Tense.Fut@” and consequently, only strings are possible which include the flag diacritic “@P.Tense.Fut@”. Similarly, the first line in the lexicon TENSE SLOT6 is marked by the flag diacritic “@R.Tense.nFut@” which specifies that this choice, i.e., no morpheme attaching in this slot, is only possible if the value of the feature “Tense” has been set to “nFut” before. (5.18) Lexicon ROOT bam NFUT; ba FUT; Lexicon NFUT @P.Tense.nFut@ LEX; Lexicon FUT @P.Tense.Fut@ LEX; . . . . . . . . . Lexicon TENSE SLOT6 @R.Tense.nFut@ #; @R.Tense.Fut@:[email protected]@ #; The dependencies between the classifier and lexical stems can be modeled in a similar way. They again need flag diacritics because the dependencies are long distance, between the verbal template slot 1 and 5. (5.19) shows an excerpt from the lexc lexicons. The classifier stems stand(3) and hands(8) can (among others) combine with the lexical stem dharday ‘down’. To model the long distance dependencies, flag diacritics are used to remember which classifier stem has been chosen. A separate flag diacritic is used for each classifier stem. This also means that the lexical stem has to be listed multiple times, i.e., for each combination with a classifier stem.4 Seiss (2011) argues for modeling the combinations of classifier and lexical stems in the sublexical entries of xle. While this is desirable from a theoretical perspective as the restrictions on the combinations are then bound to argument structure information, from a computational implementation perspective, modeling the dependencies in the xfst morphology already has a range of advantages: The morphology can then be used as a stand alone application, for example in a corpus study, as is described in Section 5.3. 162 A computational implementation of the morphology (5.19) Lexicon ROOT ngirra @P.CLASS.3@ : ngirra +class3 @P.CLASS.3@ FUT mam @P.CLASS.8@ : mam +class8 @P.CLASS.8@ NFUT ... Lexicon Lexical Stems dharday @R.CLASS.3@ : dharday +LS @R.CLASS.3@ SLOT6 dharday @R.CLASS.8@ : dharday +LS @R.CLASS.8@ SLOT6 These dependencies, i.e., the dependencies for tense markers and the dependencies between classifier and lexical stems, are quite simple examples of long distance dependencies. However, flag diacritics also allow the modeling of complex long distance dependencies such as the subject number and object marker dependencies which are dependencies between three different verbal template slots, as they affect slot 1 for the classifier stem as well as slots 2 and 8 for the object and number markers. As was discussed above, the subject number markers and the object markers compete for the same slots, i.e., for slots 2 and 8. For example, the dual subject number marker has to attach in slot 2 if no object marker is present. If an object marker is present, the object marker has to attach in slot 2 and the subject number marker can only be realized in slot 8. This was illustrated with the examples in (5.7). These facts concerning the singular classifier stem can be modeled with the help of flag diacritics as displayed in the implementation fragment in (5.20). (5.20) Lexicon ROOT [email protected]@..:[email protected]@ SLOT2; Lexicon SLOT2 @P.SMark.no@ RR; +1sgDO:ngi RR; [email protected]@@R.Num.sg@ :[email protected]@@R.Num.sg@ RR; ... ... Lexicon SLOT8 [email protected]@@D.SMark.no@@R.Num.sg@ :[email protected]@@D.SMark.no@@R.Num.sg@ #; In the lexicon ROOT, the classifier stem form bam is associated with the singular form of classifier 13, and this choice is marked by the P-type flag diacritic, i.e., it remembers that the value for the number feature singular has been set positively. In the lexicon SLOT2, three different choices are possible. In the first case, nothing is attached. This is for example the case for intransitive verbs with singular subjects. However, the system has to remember that nothing has been attached in this slot, which is implemented with the flag diacritic A computational implementation of the morphology 163 @P.SMark.no@. Alternatively, an overtly expressed object marker can attach in slot 2, e.g. the marker for the first person singular direct object marker -ngi. As a third choice, the dual male non-sibling subject number marker -nintha can attach in slot 2. However, -nintha can only attach if the classifier stem form is singular, which is modeled by the flag diacritic @R.Num.sg@, which requires the value of the number feature to have been positive before. In this case, the flag diacritic @P.SMark.pres@ tells the system to remember that the dual subject marker is present in slot 2. The lexicon SLOT8 then takes care of all possible choices. Thus, the dual subject number marker can only attach in slot 8 if it is not present in slot 2. This dependency is modeled by the flag diacritic @D.SMark.pres@ which disallows this choice if the value of the feature SMark has been set to “pres(ent)” before. Secondly, the dual number marker can only attach in slot 8 if slot 2 is not empty, i.e. this choice is disallowed if the value of the SMark has been set to “no” before. And thirdly, as has been already discussed before, the classifier stem has to be in singular form. The actual implementation is, however, even more complicated as subject and object number morphemes also compete for slot number 8, as has been discussed above for the example (5.8). That is, nintha/ngintha can also mark object number if a dual marker such as nganku is present in slot 2. This can be modeled accordingly with flag diacritics as well, as can be seen in the more detailed implementation fragment in (5.21). (5.21) Lexicon ROOT [email protected]@..:[email protected]@ SLOT2; Lexicon SLOT2 @P.SMark.no@ RR; +1sgDO:ngi RR; [email protected]@@R.Num.sg@ :[email protected]@@R.Num.sg@ RR; [email protected]@:[email protected]@ RR; ... ... Lexicon SLOT8 [email protected]@@D.SMark.no@@R.Num.sg@ :[email protected]@@D.SMark.no@@R.Num.sg@ #; [email protected]@ :[email protected]@ #; [email protected]@ :[email protected]@ ZEROSMARK; Lexicon ZEROSMARK [email protected]@ :@R.Num.sg@ #; If a dual object marker such as nganku is attached in SLOT2, it is marked 164 A computational implementation of the morphology with the flag diacritic @P.DOMark.du@. In the lexicon SLOT 8, the interpretation of nintha as an object marker is only allowed if a dual object marker has been attached in SLOT 2. This is implemented with the flag @R.DOMark.du@. In this case, then, the dual subject marker does not have a place to attach overtly in the verbal template and the example is ambiguous. This is modeled with the addition of another lexicon ZEROSMARK, in which the tag +du.m.Nsibl.S is attached if the classifier stem is in singular number (@R.Num.sg@).5 The advantage of modeling these interdependencies in as much detail as possible is that the resulting network just represents the valid combinations of verbal morphemes and does not overgenerate. This is especially important because of the large degree of syncretism which would, without a detailed modeling of the dependencies, lead to many different analyses for one item. The inbuilt mechanisms provided by lexc and xfst thus allow a reliable model of the details of the complexities of the Murrinh-Patha verb. The following section describes how these mechanisms can be put to use in a robust morphological analyzer with a stepwise lookup strategy. 5.3 Developing a robust morphological analyzer with a stepwise lookup strategy The previous section described the details that are needed if one implements such morphologically complex languages and the tools that xfst and lexc offer for these details. This section now discusses how these implementations can be put to use in a robust morphological analyzer in a corpus study of Murrinh-Patha. For this aim, a stepwise lookup strategy is used that relaxes the constraints of the verbal template stepwise and also includes a morphological guesser. In this way, both, high coverage and precision, can be achieved. The implementation of a morphological analyzer usually starts out with the collection of the relevant facts, i.e., in the case of the Murrinh-Patha morphological analyzer, the description of the verbal template and the inflection for the other parts of speech were studied. For this purpose, it mainly used Street (1989) and Joe Blythe’s toolbox dictionary6 as a database for lexical items as well as Nordlinger (2010c) for the description of the verbal template and Blythe (2009a) for nominal inflection. As such, the computational implementation profits from the rather substantial theoretical description existing for Murrinh-Patha. However, these resources are limited nonetheless. Especially for the bigger syntactic categories such as nouns and lexical stems it has to be assumed The object marker nganku can also combine with ngime/nime to mark paucal nonsibling objects. This is modeled in the same manner as the dual interpretation. http://www.sil.org/computing/toolbox/ A computational implementation of the morphology 165 that the database cannot cover all existing lexical items. For example, approximately 1040 different lexical stems and 1250 different nouns were listed in the database. It can be expected that more lexical items exist for these categories. For this reason it makes sense to test the morphological analyzer on a larger collection of real text instead of selected test sentences or words. In this way, non-expected lexical items and constructions may be spotted which can then be analyzed, tested in fieldwork and included in the implementation. For the development of the Murrinh-Patha morphological analyzer, a small Bible corpus has been used. The translations of the Bible chapters comprise around 70800 words. In order to make predictions of how well a computational analysis may work on unexpected texts, it is common in computational linguistics to divide a corpus into a development corpus and a test corpus. The output of the development corpus is analyzed during the development of the morphological analyzer and new insights that the output generates are implemented in the analyzer. Once the development of the morphological analyzer based on the development corpus is finished, the morphological analyzer is run over the test corpus. The results of the test corpus then indicate how well the morphological analyzer does on unknown text. For the corpus study of Murrinh-Patha, the Bible corpus has been divided into a development corpus (approximately 85%) and a test corpus (approximately 15%). Thus, the development corpus consists of around 60080 words and the test corpus of around 10720 words. As these corpora comprise running text, many word forms of course occur multiply in the corpus. This is especially the case for nominals and other lexical items which can be inflected for case and discourse marking, but which do not show the same high morphological complexity as verbs. The most extreme example is the conjunction i ‘and’ which occurs 2416 times in the whole corpus. Running the morphological analyzer over the text as it is provides a morphological analysis for each word as often as it occurs in the corpus. However, the morphological analyzer can also be tested on word lists that are extracted from the running text. In these word lists, each different surface form only occurs once. That is, the conjunction i ‘and’ is only once in the word list and thus receives a morphological analysis only once. In the corpus study presented in this chapter, word lists were extracted both for the development and the test corpus. The development corpus rendered a word list of 5626 types and the test corpus a wordlist of 1643 types. Table 5.3 provides an overview over the various versions of the Bible corpus used in the development and evaluation of the morphological analyzer. The results of the word lists and the running text complement each other. The results of the word lists show which surface forms can be given an analysis. It however does not make a distinction between high and low 166 A computational implementation of the morphology Running text Word lists Development corpus 60080 words 5626 types Test corpus 10720 words 1643 types Table 5.3: Overview over different versions of Bible corpus used in the development and evaluation of the morphological analyzer frequency items. This distinction becomes apparent only if the results of the word list are compared with the results of the running test. If the results of the running text are better than the results of the word list, this means that the morphological analyzer does well on higher frequency items. Such tests on running text and on word lists can be used to optimize the analyzer. For the development of the Murrinh-Patha morphological analyzer, the results of applying the analyzer to the word list and the running text of the development corpus were analyzed carefully. The development of the morphological analyzer started out with a close modeling of the well-documented facts of the Murrinh-Patha morphology. This included the inflection of nominals with case and discourse markers as well as the implementation of the complex verbal template which required the implementation of the various dependencies between the morphemes with flag diacritics as described in the previous section. This morphological analyzer was then applied to the development corpus described above. The interpretation of the output led to the addition of new lexical items to the analyzer. However, it should be noted that the output of the morphological analyzer at the development stage could not be checked in fieldwork, and the translation that is provided in the bible corpus is very free. Consequently, the interpretation of the output is quite difficult, and additions to the morphological analyzer have been made only very tentatively. For this reason, mainly higher frequency items, borrowed nouns and spelling alternatives whose meaning could be deduced from the translation were added. For example, 20 Murrinh-Patha nouns were added. These comprise spelling alternatives or nouns which did not have a lexical entry in Blythe’s toolbox dictionary or in Street (1989), but which could be found in the resources nonetheless, e.g. in example sentences for other lexical entries. Spelling alternatives have been added when the content could verify that indeed the same word was meant. Some examples for spelling alternatives are kulututuk in the corpus for which kurlurnturtuk ‘dove’ could be found in Street (1989) or purrkpurrk in the corpus corresponding to purrpurrk ‘small and numerous’ in Street (1989). For borrowed nouns, 201 new lexical A computational implementation of the morphology 167 Lexical items Number of entries all forms of 38 classifier stems incorporated body parts 38 incorporated adverbs / particles 21 lexical stems 1041 classifier plus lexical stem combinations 2173 nouns 1273 noun class markers 10 pronouns 44 borrowed nouns 201 adjectives 175 adverbs 44 interjections 55 interrogatives 23 demonstratives 10 non-conjugated verbs 11 numerals 8 Table 5.4: Number of lexical entries for the Murrinh-Patha morphological analyzer. Top: different morphemes of the verb; Bottom: stems for other parts of speech. items were added, especially names of people and places, animal names and a couple of words for concepts and objects which seem to not have a corresponding word in Murrinh-Patha, e.g., basket, olive, birthday etc. After the inspection of the development corpus, the morphological analyzer comprises the lexical entries as specified in Table 5.4. The lexicon contains all forms of the 38 classifier stems as well as all forms of the functional morphemes, i.e., number markers, tense markers etc. It comprises 1041 lexical stems, which include simple lexical stems as well as their lexicalized reduplicated versions. With these lexical stems, 2173 classifier plus lexical stem combinations can be formed, i.e., each lexical stem can combine with roughly two classifier stems. Nouns form the other large class, with 1271 different lexical items. Finally, adjectives, adverbs, interjections, interrogatives etc. have also been implemented in the morphological analyzer. When this morphological analyzer, i.e., with the additions to the lexicon, is applied to the word list of the development corpus, 82,55% of words can be given an analysis. An analysis of the unanalyzable 17,45% of the output shows that this comprises many verbs. If verbs do not receive an analysis by the morphological analyzer, this may be due to two reasons: either the 168 A computational implementation of the morphology combination of classifier and lexical stems is unknown or the lexical stem is unknown. As can be seen in Table 5.4, 2173 classifier and lexical stem combinations, i.e., 2173 verbs, have been implemented in the morphological analyzer. It is very likely that more combinations of lexical and classifier stems exist and that these have not been listed in the database so far. Similarly, it is quite likely that other lexical stems exist which are are not listed in the database so far. However, it is desirable that these combinations receive a morphological analysis as well. However, as these new combinations have not been documented so far, they should be specially marked in the output of the morphological analyzer to distinguish them from the combinations that were known from the database. This can be accomplished by making use of a stepwise lookup strategy. A stepwise lookup strategy can be used whenever one analysis is given priority over another analysis. For this purpose, a morphological analyzer uses different lookup strategies to analyze the input. The input is first analyzed by a first strategy of the morphological analyzer. If this strategy does not provide an analysis for the input, the second strategy is used etc. Such a stepwise lookup strategy is for example used in a morphological analyzer for Basque (Alegria et al. 1996). In this analyzer, the first lookup strategy analyses the standard language. Only if the first lookup strategy does not provide an analysis, the input is passed on to the second lookup strategy which analyses linguistic variants such as dialectal constructions. As a third lookup strategy, Alegria et al. (1996) propose an analysis for words that are not covered by the lexicon. For the Murrinh-Patha morphological analyzer, a similar stepwise lookup strategy is used. The first strategy consists of the morphological analyzer as described so far. The second lookup strategy covers the cases of unknown combinations of classifier and lexical stems. This second lookup strategy only applies if the first strategy cannot provide an analysis. In this way, priority is given to already known classifier and lexical stem combinations. The second lookup strategy generates candidates for new classifier and lexical stem combinations only. Their meaning has to be established though additional analysis, for example by looking at the translation of the corpus or by additional fieldwork. The following section discusses the interpretation of the new classifier and lexical stem combinations in detail. For the implementation of this second lookup strategy, the same finite state network is used as for the first lookup strategy. The only difference is that the flag diacritics which model the dependencies between the classifier and lexical stems have been left out. This means that the network accepts any combination of classifier and lexical stem. With this second lookup strategy, 4,28% of the development corpus word list could be analyzed. As was pointed out above, Murrinh-Patha verbs could also be unanalyzable for the computational implementation because the lexical stem involved A computational implementation of the morphology 169 is not listed in the lexicon. As is listed in Table 5.4, 1041 lexical stems could be extracted from the database. Lexical stems are the only large class of morphemes that constitute verbs. Classifier stems as well as the other morphemes such as tense and number markers, etc., all come from small, closed classes of morphemes. If a verb can thus not be given an analysis with the first two lookup strategies, it is most likely that the lexical stem is unknown. As with the unknown classifier and lexical stem combinations discussed above, these verbs should receive a morphological analysis. For this purpose, a morphological guesser for lexical stems is incorporated into the system as a third lookup strategy. The morphological guesser enables the network to guess the form of a lexical stem without the need of having the lexical stem in the lexicon. The guesser for Murrinh-Patha lexical stems has been implemented as described by Beesley & Karttunen (2003) for guessers in general. In a given verb, the classifier stem and the other morphemes, for example tense and number markers, are identified just as in a normal finite-state network. What is left if these morphemes are identified can then be considered the candidate for the lexical stem. (5.22) provides an example. perremka can be identified as the third person dual, non-future form of the classifier stem poke:rr(21) while -neme is the paucal male number marker. The guessed lexical stem in this case is wirnturt.7 (5.22) perremka-wirnturt-neme 3duS.poke:rr(21).nfut-lexical.stem-pauc.m In the implementation of the guesser, the lexc lexicon for lexical stems includes a placeholder entry which is then replaced in a network by all phonologically possible lexical stems. As an analysis of the database used shows, lexical stems need to have at least one syllable and they usually begin with a consonant. However, some lexical stems beginning with the vowel ‘a’ are also attested. These constraints have been integrated into the morphological guesser. The guessed lexical stems may also undergo phonological changes. This means that different guesses may be possible for one input string. For example, if the lexical stem ngkardu ‘see’ were unknown, a possible guess for the input string bamkardu would be both, kardu and ngkardu as lexical stems in combination with the classifier stem 13. Because of the various guesses due to the phonological rules, the morphological analyzer has to have the second and third lookup strategies as separate steps. The guesser could also cover the known lexical stems in unknown combinations, but due to the phonological rules, the guesser produces As is discussed in Section 5.4.2, wirnturt is just a spelling alternative to the lexical stem wirndurt that is listed in Street (1989). 170 A computational implementation of the morphology Figure 5.1: Stepwise lookup strategy for the Murrinh-Patha morphological analyzer for the development corpus. various possible analyses. Separating the two steps by first relaxing the constraints on the classifier and lexical stem combinations and then guessing unknown lexical stems gives priority to already existing lexical stems. It so constrains the amount of possible analyses for the input strings with a known lexical stem. A morphological guesser only makes sense for lexical items which obligatorily combine with other morphemes as these morphemes serve as a restrictor for the various guesses. It is therefore helpful for the guesser if all the dependencies between the morphemes are implemented for the guesser as well. For example, in (5.22), the lexical stem can be guessed as wirnturt as there is a restriction that -neme can only attach to dual classifier stems. The guessed lexical stem cannot be kawirnturt, for example, although a classifier stem perrem exists, which is, however, the plural form of the classifier stem poke:rr(21). The system is constrained to detecting more Murrinh-Patha verbs. No guesser has been implemented for the other parts of speech as these do not have to be inflected. Nouns offer themselves as candidates for guessers cross-linguistically as nouns usually form a large open class. It is therefore desirable to have a morphological guesser for nouns. However, nouns in Murrinh-Patha do not have to be case-marked or carry other inflection. For this reason, a guesser for nouns is not helpful as all input strings could then be analyzed as nouns, which is not desirable. The third lookup strategy thus only comprises a guesser for lexical stems. The stepwise lookup strategy as described so far is illustrated in Figure A computational implementation of the morphology 171 strategy 1: 4644 words 82.55% strategy 2: 241 words 4.28% strategy 3: 686 words 12.19% not found: 55 words 0.98% corpus size: 5626 words execution time: 1 sec speed: 5626 words/sec Table 5.5: Output of the morphological analyzer with three strategies for the development corpus word lists: strategy 1 involves full constraints, strategy 2 has relaxed constraints on classifier and lexical stems and strategy 3 guesses unknown lexical stems. 5.1. The lookup strategy makes the morphological analyzer robust without the danger of overgenerating. Thus, if the system is used to analyze a corpus, the system can detect new combinations of classifier plus lexical stem combinations and new lexical stems. The results of the morphological analysis of the word list of the development corpus with the described stepwise morphological analyzer are presented in Table 5.5. As can be seen in the table, the overall coverage of the morphological analyzer is already very satisfying, only 0.98% of the words could not be given an analysis. However, the morphological guesser applies to 12.19% of the words, which is quite high. When looking at the output of the morphological guesser in more detail, it becomes apparent that the morphological guesser provides guesses for lexical stems in verbs which could also be analyzed differently. This is the case for some morpheme orderings which have not been described before, or which were described as ungrammatical. For example, in (5.23), the adverb deyida ‘again’ occurs before the tense marker -dha while following the verbal template, it should only be able to attach after the tense marker. As a result, the lexical stem guessed by the morphological analyzer is ngerrendeyida. (5.23) kardi-ngerren-deyida-dha 3sgS.be(4).pimpf-talk-again-pimpf To be able to provide a more coherent analysis of such unexpected morpheme orderings, another strategy has been added to the stepwise morphological analyzer. These new morpheme orderings, as well as some relaxed constraints on the morpheme cooccurrences, have been given a separate lookup step and not been included in the basic morphological analyzer to be able 172 A computational implementation of the morphology to consider these cases separately. In the following, the morpheme orderings and relaxed constraints for this new lookup strategy are described briefly. As has been described above already, the lookup strategy 2 relaxes the constraints on the combination of classifier and lexical stems to detect new combinations of classifier and lexical stems. In the new lookup strategy, the constraints on the tense markers and on the number markers have been relaxed. For tense markers, Nordlinger & Caudal (2012) describe the known combinations of classifier stem inflection and separate tense markers. However, the corpus contains some instances of verbs which do not follow the description in Nordlinger & Caudal (2012). An example is provided in (5.24). Both the past imperfective and the past irrealis form of the classifier stem require the additional tense marker -dha, which is missing in this case. (5.24) bina-na-yepup 3sgS.16.pimpf/pstirr-3sg.ben-listen Similarly, the corpus contains instances of the number markers -nintha, -ngintha, -neme and -ngime which are not licensed by the classifier stem or the object markers. As has been discussed above, the constraints on these number markers are quite complex. They can be licensed by the classifier stem (-nintha/-ngintha can only occur with singular classifier stems and -neme/-ngime can only occur with the dual classifier stem) or by certain direct or indirect object markers. In (5.25), the dual non-sibling female marker -ngintha is not licensed because the classifier stem is either dual or plural and the second person singular indirect object marker -mpa also does not combine with an additional number marker normally. (5.25) ngira-mpa-winhadhath-tha-ngintha 1du/plS.watch(28).pimpf-2sg.ben-look.for-pimpf-du.f To be able to provide these and similar ungrammatical or at least undescribed combinations of morphemes with an analysis, the constraints on the number and tense markers have been relaxed in the new lookup strategy. Besides relaxed constraints, some undescribed morpheme orderings are also allowed in this lookup strategy. Firstly, the number markers, most often -neme, occur at the end of the verb, even after discourse marking etc.8 In (5.26), the number marker -neme occurs after the emphatic discourse marker -wa. That the number marker occurred at the end of the verb was quite common in the corpus. (5.26) pume-berti-dha-mana-wa-neme 3du/plS.hands(8).pimpf-take-pimpf-only-dm-pauc.m The discourse markers were not part of the verbal template presented in Figure 5.1 as it seems that they can attach to the end of almost every word. A computational implementation of the morphology 173 Secondly, the incorporated adverb deyida ‘again’ occurred in unexpected slots, before the tense marker, as was already shown in (5.23). In the new lookup strategy, an additional slot for deyida between the lexical stem and the tense marker has been added. Thirdly, verbs seem to be able to take case marking, most often the case marker -re/te. Most often these case markers occur before the discourse markers such as in (5.27). (5.27) a. nuru-rdurr-nu-re-ka 2plS.go(6).futirr-leave-fut-erg/instr-dm b. nuru-lili-nu-re-yu 2plS.go(6).futirr-walk-fut-erg/instr-dm However, examples in which the case marker attaches after the discourse marker can be found as well. This ordering is also found for other parts of speech. For example, in (5.28), the pronoun nukunu occurs with the discourse marker -wa and the dative marker -nu.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Applicativizing Complex Predicates: a Case Study from Murrinh-patha

In this paper we give an analysis of Murrinh-Patha verbs as morphological complex predicates. We argue that the different parts of the complex predicate provide information for different layers of the argument structure; more precisely, that classifier stems determine the number of arguments a verbal complex takes while lexical stems contribute thematic information. We further show how the argu...

متن کامل

A Rule-based Morphological Analyzer for Murrinh-Patha

Resource development mainly focuses on well-described languages with a large amount of speakers. However, smaller languages may also profit from language resources which can then be used in applications such as electronic dictionaries or computer-assisted language learning materials. The development of resources for such languages may face various challenges. Often, not enough data is available...

متن کامل

Indigenous Perspectives on the Vitality of Murrinh-Patha

This paper reports on recent research into community attitudes around the vitality of Murrinh-Patha; a polysynthetic, nonPama-Nyungan language spoken by approximately 2000 speakers living in and around Wadeye (Port Keats) in the Daly River region of the Northern Territory of Australia. The report is part of an ongoing research program that aims to identify the role of the community in the stron...

متن کامل

Verbs in Applied Linguistics Research Article Introductions: Semantic and syntactic analysis

This study aims to investigate the semantic and syntactic features of verbs used in the introduction section of Applied Linguistics research articles published in Iranian and international journals. A corpus of 20 research article introductions (10 from each journal) was used. The corpus was analysed for the syntactic features (tense, aspect and voice) and semantic meaning of verbs. The finding...

متن کامل